Skip to content

CBO Phase 3: HyperLogLog NDV Estimator#82

Merged
poyrazK merged 7 commits into
mainfrom
feature/hll-ndv
May 14, 2026
Merged

CBO Phase 3: HyperLogLog NDV Estimator#82
poyrazK merged 7 commits into
mainfrom
feature/hll-ndv

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 12, 2026

Summary

  • Add HyperLogLog class (include/common/hll.hpp) — memory-bounded probabilistic NDV estimator (~12KB per column vs unbounded unordered_set)
  • Integrate HLL into execute_analyze() replacing unordered_set<string> for NDV collection
  • Add hll_test.cpp unit tests (10 tests passing)
  • Fix ExecutionTests.AnalyzeTable to use EXPECT_GE instead of exact equality (HLL is probabilistic)

Implementation Details

  • Hash: FNV-1a (replaces djb2 which had poor upper-bit distribution)
  • Index: bottom 11 bits of hash → register index (0-2047)
  • Register value: trailing zeros in upper 53 bits + 1
  • Small cardinality fix: linear counting fallback when nonzero registers < m/20

Test Results

  • 37/38 tests pass (server_tests SIGPIPE is pre-existing)
  • HLL tests: 10/10 pass
  • AnalyzeTable and AnalyzeTableLargeRows pass

Note

The branch was created from main and pushed successfully. Ready for review.

Summary by CodeRabbit

  • New Features
    • Added a fixed-size HyperLogLog distinct-count estimator and switched ANALYZE TABLE to use memory-bounded probabilistic sketches for NDV estimation.
  • Tests
    • Extensive unit and integration tests for the sketch (accuracy bounds, merge, seed behavior, reset, hashing) and updated an existing NDV assertion to a probabilistic lower bound.
  • Chores
    • Registered a new test executable in the test suite.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Adds a header-only HyperLogLog implementation, uses per-column HLL sketches in QueryExecutor::execute_analyze to estimate NDV, adds unit tests for HLL, adjusts an integration test to accept probabilistic NDV, and registers a new CMake test target hll_tests.

Changes

HyperLogLog Cardinality Estimation and ANALYZE TABLE

Layer / File(s) Summary
HyperLogLog Library Implementation
include/common/hll.hpp
New cloudsql::common::HyperLogLog header: 2048-register sketch, insert(), cardinality() with linear-counting and bias adjustment, reset(), merge(), and hash_bytes() (FNV-1a).
QueryExecutor::execute_analyze Integration
src/executor/query_executor.cpp
Replaces per-column unordered_set NDV tracking with std::vector<common::HyperLogLog>; hashes values by type during table scan and inserts into HLL; derives column NDV via HyperLogLog::cardinality() before updating catalog stats.
HyperLogLog Unit & Integration Tests
tests/hll_test.cpp
New GoogleTest suite covering empty/non-empty cardinality, insert behavior, distinct-value streams, hash_bytes determinism/difference, reset(), merge(), seed reproducibility/difference, accuracy sanity bounds, and value-type insertion coverage.
Integration test adjustment
tests/cloudSQL_tests.cpp
Updates ExecutionTests.AnalyzeTable to assert a probabilistic lower bound (EXPECT_GE) for text-column NDV instead of exact equality.
CMake Test Target Registration
CMakeLists.txt
Adds add_cloudsql_test(hll_tests tests/hll_test.cpp) to BUILD_TESTS to build/register the new test executable.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • poyrazK/cloudSQL#2: Adds GoogleTest/CMake test registration macros used to register the new hll_tests target.
  • poyrazK/cloudSQL#81: Modifies CMakeLists.txt's BUILD_TESTS test registrations (similar test-target additions).
  • poyrazK/cloudSQL#75: Also modifies QueryExecutor::execute_analyze(...); overlaps with analyze/NDV changes.

Poem

🐰 I hopped in with registers bright and small,
Counting uniques without storing them all,
I hash each byte with FNV's tune,
Merge and reset beneath the moon,
ANALYZE now hums an estimate for all.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'CBO Phase 3: HyperLogLog NDV Estimator' directly and clearly summarizes the main change: adding a HyperLogLog implementation for number of distinct values (NDV) estimation as part of the query optimizer cost-based optimization (CBO) initiative, which is confirmed by all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/hll-ndv

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

poyrazK added 2 commits May 12, 2026 18:25
Adds new hll_tests executable for HyperLogLog unit testing.
Replaces std::unordered_set<std::string>> NDV collection with
std::vector<common::HyperLogLog>. Uses FNV-1a hash for text (better
upper-bit distribution than djb2) and Value::Hash for numeric types.
Linear counting fallback for small cardinalities prevents HLL's
extreme overestimation when few registers are used.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/executor/query_executor.cpp (1)

307-324: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Route ANALYZE TABLE through execute_analyze(...).

execute_analyze(...) is implemented below, but this dispatcher never calls it, so ANALYZE TABLE will still fall into "Unsupported statement type".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 307 - 324, The dispatcher in
execute (the if/else chain handling parser::StmtType cases) is missing a branch
for parser::StmtType::Analyze; add an else-if branch that calls execute_analyze
with the statement cast to the correct type (e.g., dynamic_cast<const
parser::AnalyzeStatement&>(stmt)) and pass the current transaction object (txn)
if execute_analyze expects it, mirroring how execute_select/execute_insert/etc.
are invoked so ANALYZE TABLE routes to execute_analyze instead of falling
through to "Unsupported statement type".
🧹 Nitpick comments (1)
tests/hll_test.cpp (1)

56-165: ⚡ Quick win

Strengthen these assertions beyond “non-zero”.

Most of this suite still passes if cardinality() returns 1 for every non-empty sketch, so it won't catch broken HLL math. For a deterministic input stream, add a few bounded accuracy checks and verify merged sketches estimate at least as large as each input sketch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/hll_test.cpp` around lines 56 - 165, Replace the weak "non-zero"
assertions with bounded-accuracy checks: for tests that insert N distinct items
(e.g., the loops in DistinctValuesProduceCardinality,
DistinctValueSetsProduceCardinality, TextValueInsertion,
MergeCombinesDistinctSets) assert the returned HyperLogLog::cardinality() lies
within a reasonable relative tolerance of N (for example N*0.5 <= card <= N*1.5)
or use a configurable relative error constant (e.g., 1.6% * k or a fixed 50% for
small N) and assert merge preserves/raises counts (after hll1.merge(hll2) assert
hll1.cardinality() >= old_hll1_cardinality and >= old_hll2_cardinality);
similarly replace the single zero-check after reset to EXPECT_EQ(cardinality(),
0U) remains but ensure before reset you capture and assert expected range; use
HyperLogLog::insert, cardinality(), merge(), reset(), and hash_bytes() as
anchors when editing each test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/common/hll.hpp`:
- Around line 122-125: merge() silently combines HyperLogLog sketches with
different hash seeds which corrupts results; change merge(const HyperLogLog&
other) noexcept to remove noexcept and add a check that other.seed_ == seed_,
and if not, throw a descriptive std::invalid_argument (or std::runtime_error)
mentioning mismatched seeds; if seeds match, proceed with the existing loop over
kNumRegisters updating registers_.at(i). Ensure you reference and update the
function signature and any callers to handle the exception.
- Around line 83-111: The sparse-case and raw HLL formulas are wrong (they use
log2 and lack alpha scaling); replace the linear counting and raw estimate
calculations so linear counting uses the natural log (std::log) with V =
empty_count and E = -m * std::log(static_cast<double>(empty_count) / m), and
replace raw_estimate with the HLL estimator raw_estimate = alpha_m * m * m / sum
where alpha_m is chosen per m (compute alpha_m using the standard small-m
constants or formula for get_alpha(m)); keep existing bounds handling for
estimate, casting and use kMaxCardinality as before (refer to nonzero_count, m,
empty_count, sum, raw_estimate, estimate, kMaxCardinality).

In `@src/executor/query_executor.cpp`:
- Around line 955-961: The text branch currently hashes only a 64-byte prefix
(see col_info.type checks and val.as_text()) which undercounts NDV; change it to
hash the entire string by passing the full length of val.as_text() to
common::HyperLogLog::hash_bytes (replace the prefix_len logic and use s.size()),
ensuring the computed hash value assigned to variable hash covers the whole text
for ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.

In `@tests/cloudSQL_tests.cpp`:
- Around line 1263-1610: The file contains unresolved git conflict markers
surrounding a large block of new tests (e.g., TEST(ExecutionTests,
AnalyzeTable), AnalyzeTableLargeRows, AnalyzeTableNonExistent,
AnalyzeTableEmpty, AnalyzeTableWithNulls, AnalyzeTableStringStats,
AnalyzeFilterSelectivity, AnalyzeJoinOrder); remove the conflict markers
(<<<<<<<, =======, >>>>>>>) and keep the intended test additions (the ANALYZE
TABLE test cases) as a single coherent block, ensuring only one copy of each
TEST(...) remains and the surrounding includes/namespace usages remain valid so
the file compiles.

---

Outside diff comments:
In `@src/executor/query_executor.cpp`:
- Around line 307-324: The dispatcher in execute (the if/else chain handling
parser::StmtType cases) is missing a branch for parser::StmtType::Analyze; add
an else-if branch that calls execute_analyze with the statement cast to the
correct type (e.g., dynamic_cast<const parser::AnalyzeStatement&>(stmt)) and
pass the current transaction object (txn) if execute_analyze expects it,
mirroring how execute_select/execute_insert/etc. are invoked so ANALYZE TABLE
routes to execute_analyze instead of falling through to "Unsupported statement
type".

---

Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 56-165: Replace the weak "non-zero" assertions with
bounded-accuracy checks: for tests that insert N distinct items (e.g., the loops
in DistinctValuesProduceCardinality, DistinctValueSetsProduceCardinality,
TextValueInsertion, MergeCombinesDistinctSets) assert the returned
HyperLogLog::cardinality() lies within a reasonable relative tolerance of N (for
example N*0.5 <= card <= N*1.5) or use a configurable relative error constant
(e.g., 1.6% * k or a fixed 50% for small N) and assert merge preserves/raises
counts (after hll1.merge(hll2) assert hll1.cardinality() >= old_hll1_cardinality
and >= old_hll2_cardinality); similarly replace the single zero-check after
reset to EXPECT_EQ(cardinality(), 0U) remains but ensure before reset you
capture and assert expected range; use HyperLogLog::insert, cardinality(),
merge(), reset(), and hash_bytes() as anchors when editing each test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec6e791d-6dfa-461f-bae4-f36dbc2e255b

📥 Commits

Reviewing files that changed from the base of the PR and between 18d5d03 and e788277.

📒 Files selected for processing (5)
  • CMakeLists.txt
  • include/common/hll.hpp
  • src/executor/query_executor.cpp
  • tests/cloudSQL_tests.cpp
  • tests/hll_test.cpp

Comment thread include/common/hll.hpp
Comment on lines +83 to +111
// For sparse data (few registers used), use linear counting to avoid
// HLL's extreme overestimation. When registers are sparse (nonzero < m/20),
// the HLL raw formula gives wildly incorrect results.
if (nonzero_count < static_cast<int>(m / 20)) {
// Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction
// Using simple form without alpha scaling for very sparse data
double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);
return static_cast<uint64_t>(std::max(1.0, linear_est));
}

// Standard HLL formula for moderate to large cardinalities
double raw_estimate = m * std::log2(m / sum);

// Bias correction for small cardinalities
// HLL systematically overestimates for small n; apply downward bias
double bias = 0.0;
if (raw_estimate <= 2.5 * m) {
bias = -0.5 * (m / 10.0);
}

double estimate = raw_estimate + bias;

if (estimate < 0) {
return 0;
}
if (estimate > static_cast<double>(kMaxCardinality)) {
return kMaxCardinality;
}
return static_cast<uint64_t>(estimate);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the actual HLL estimators here.

Both estimates are using log2(...), but standard HLL needs linear counting with the natural log for the sparse case and alpha_m * m * m / sum for the raw estimate. As written, this will store materially biased NDV stats for ANALYZE TABLE.

Suggested fix
-        if (nonzero_count < static_cast<int>(m / 20)) {
-            // Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction
-            // Using simple form without alpha scaling for very sparse data
-            double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);
-            return static_cast<uint64_t>(std::max(1.0, linear_est));
-        }
-
-        // Standard HLL formula for moderate to large cardinalities
-        double raw_estimate = m * std::log2(m / sum);
+        const double alpha_m = 0.7213 / (1.0 + 1.079 / m);
+        double raw_estimate = alpha_m * m * m / sum;
+        if (empty_count > 0 && raw_estimate <= 2.5 * m) {
+            const double linear_est = m * std::log(m / static_cast<double>(empty_count));
+            return static_cast<uint64_t>(std::llround(linear_est));
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// For sparse data (few registers used), use linear counting to avoid
// HLL's extreme overestimation. When registers are sparse (nonzero < m/20),
// the HLL raw formula gives wildly incorrect results.
if (nonzero_count < static_cast<int>(m / 20)) {
// Linear counting: E[n] ≈ -m * log(V/m) where V = empty fraction
// Using simple form without alpha scaling for very sparse data
double linear_est = -m * std::log2(static_cast<double>(empty_count) / m);
return static_cast<uint64_t>(std::max(1.0, linear_est));
}
// Standard HLL formula for moderate to large cardinalities
double raw_estimate = m * std::log2(m / sum);
// Bias correction for small cardinalities
// HLL systematically overestimates for small n; apply downward bias
double bias = 0.0;
if (raw_estimate <= 2.5 * m) {
bias = -0.5 * (m / 10.0);
}
double estimate = raw_estimate + bias;
if (estimate < 0) {
return 0;
}
if (estimate > static_cast<double>(kMaxCardinality)) {
return kMaxCardinality;
}
return static_cast<uint64_t>(estimate);
const double alpha_m = 0.7213 / (1.0 + 1.079 / m);
double raw_estimate = alpha_m * m * m / sum;
if (empty_count > 0 && raw_estimate <= 2.5 * m) {
const double linear_est = m * std::log(m / static_cast<double>(empty_count));
return static_cast<uint64_t>(std::llround(linear_est));
}
// Bias correction for small cardinalities
// HLL systematically overestimates for small n; apply downward bias
double bias = 0.0;
if (raw_estimate <= 2.5 * m) {
bias = -0.5 * (m / 10.0);
}
double estimate = raw_estimate + bias;
if (estimate < 0) {
return 0;
}
if (estimate > static_cast<double>(kMaxCardinality)) {
return kMaxCardinality;
}
return static_cast<uint64_t>(estimate);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/common/hll.hpp` around lines 83 - 111, The sparse-case and raw HLL
formulas are wrong (they use log2 and lack alpha scaling); replace the linear
counting and raw estimate calculations so linear counting uses the natural log
(std::log) with V = empty_count and E = -m *
std::log(static_cast<double>(empty_count) / m), and replace raw_estimate with
the HLL estimator raw_estimate = alpha_m * m * m / sum where alpha_m is chosen
per m (compute alpha_m using the standard small-m constants or formula for
get_alpha(m)); keep existing bounds handling for estimate, casting and use
kMaxCardinality as before (refer to nonzero_count, m, empty_count, sum,
raw_estimate, estimate, kMaxCardinality).

Comment thread include/common/hll.hpp
Comment on lines +122 to +125
void merge(const HyperLogLog& other) noexcept {
for (size_t i = 0; i < kNumRegisters; ++i) {
registers_.at(i) = std::max(registers_.at(i), other.registers_.at(i));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Block merges across different seeds.

merge() currently combines sketches even when seed_ differs. That silently unions registers from different hash domains, so the resulting cardinality is meaningless.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/common/hll.hpp` around lines 122 - 125, merge() silently combines
HyperLogLog sketches with different hash seeds which corrupts results; change
merge(const HyperLogLog& other) noexcept to remove noexcept and add a check that
other.seed_ == seed_, and if not, throw a descriptive std::invalid_argument (or
std::runtime_error) mentioning mismatched seeds; if seeds match, proceed with
the existing loop over kNumRegisters updating registers_.at(i). Ensure you
reference and update the function signature and any callers to handle the
exception.

Comment on lines +955 to +961
if (col_info.type == common::ValueType::TYPE_TEXT ||
col_info.type == common::ValueType::TYPE_VARCHAR ||
col_info.type == common::ValueType::TYPE_CHAR) {
// Use 64-char prefix for text hashing
const std::string& s = val.as_text();
size_t prefix_len = std::min(s.size(), size_t(64));
hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hash the full text value for NDV.

Using only the first 64 characters makes all longer strings with a shared prefix collide into the same HLL stream, which will systematically undercount NDV for common text patterns.

Suggested fix
-                    // Use 64-char prefix for text hashing
                     const std::string& s = val.as_text();
-                    size_t prefix_len = std::min(s.size(), size_t(64));
-                    hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);
+                    hash = common::HyperLogLog::hash_bytes(s.data(), s.size());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 955 - 961, The text branch
currently hashes only a 64-byte prefix (see col_info.type checks and
val.as_text()) which undercounts NDV; change it to hash the entire string by
passing the full length of val.as_text() to common::HyperLogLog::hash_bytes
(replace the prefix_len logic and use s.size()), ensuring the computed hash
value assigned to variable hash covers the whole text for
ValueType::TYPE_TEXT/TYPE_VARCHAR/TYPE_CHAR cases.

Comment thread tests/cloudSQL_tests.cpp Outdated
poyrazK added 2 commits May 12, 2026 18:27
New files:
- include/common/hll.hpp: HLL class with FNV-1a hash, bottom-k index,
  trailing zeros register values, linear counting for small n
- tests/hll_test.cpp: 10 unit tests covering basic operations,
  hash consistency, reset, merge, and text value insertion
HLL is probabilistic and may estimate 4 instead of 3 for small
cardinalities. Use >= instead of exact equality.
@poyrazK poyrazK force-pushed the feature/hll-ndv branch from e788277 to 1c31535 Compare May 12, 2026 15:30
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/executor/query_executor.cpp (1)

1019-1022: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Hash the full text value for NDV.

Using only the first 64 characters will cause all longer strings with a shared prefix to collide into the same HLL stream, systematically undercounting NDV for text columns with common prefixes.

🔧 Proposed fix
-                    // Use 64-char prefix for text hashing
                     const std::string& s = val.as_text();
-                    size_t prefix_len = std::min(s.size(), size_t(64));
-                    hash = common::HyperLogLog::hash_bytes(s.data(), prefix_len);
+                    hash = common::HyperLogLog::hash_bytes(s.data(), s.size());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1019 - 1022, The NDV hashing
currently only uses the first 64 bytes of the text (variable s and prefix_len)
which causes collisions for long strings; change the call to
common::HyperLogLog::hash_bytes to hash the entire string length (use s.size() /
full length) instead of std::min(..., 64) so that the full text value is passed
to hash_bytes (update the variables around s, prefix_len and the hash assignment
in query_executor.cpp accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/executor/query_executor.cpp`:
- Around line 1019-1022: The NDV hashing currently only uses the first 64 bytes
of the text (variable s and prefix_len) which causes collisions for long
strings; change the call to common::HyperLogLog::hash_bytes to hash the entire
string length (use s.size() / full length) instead of std::min(..., 64) so that
the full text value is passed to hash_bytes (update the variables around s,
prefix_len and the hash assignment in query_executor.cpp accordingly).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71bc574a-2691-44ae-baa8-a3093707a8d7

📥 Commits

Reviewing files that changed from the base of the PR and between e788277 and 6482641.

📒 Files selected for processing (5)
  • CMakeLists.txt
  • include/common/hll.hpp
  • src/executor/query_executor.cpp
  • tests/cloudSQL_tests.cpp
  • tests/hll_test.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • CMakeLists.txt
  • include/common/hll.hpp

poyrazK added 2 commits May 13, 2026 15:41
- Add kLinearCountingThreshold, kBiasCorrectionBoundary, kBiasAdjustmentFactor
  constants to replace magic numbers in cardinality() formula
- Fix missing newline at EOF in hll.hpp and hll_test.cpp
- Add AccuracyBoundsDistinct test for cardinality bounds
- Add MergeOverlappingSets test for HLL merge behavior
- Add SeedReproducibility and DifferentSeedsDiffer tests for seed behavior
Tests INT64, BIGINT, DOUBLE, and TEXT value types using the
same integration path as execute_analyze() — Value::Hash{}
for numeric types and hash_bytes() for text.
@poyrazK poyrazK force-pushed the feature/hll-ndv branch from 6482641 to 8cca192 Compare May 14, 2026 13:15
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/hll_test.cpp (1)

54-63: 💤 Low value

Consider extracting LCG constants for clarity.

The LCG multiplier 6364136223846793005ULL and increment 1442695043ULL are reused across multiple tests. Extracting them as named constants would improve readability and maintainability.

♻️ Proposed refactor

Add at the top of the anonymous namespace:

namespace {

// LCG constants for generating distinct test values
constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL;
constexpr uint64_t kLcgIncrement = 1442695043ULL;

// Then use in tests:
val = val * kLcgMultiplier + kLcgIncrement;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/hll_test.cpp` around lines 54 - 63, Extract the LCG magic numbers into
named constexprs (e.g., add constexpr uint64_t kLcgMultiplier =
6364136223846793005ULL; and constexpr uint64_t kLcgIncrement = 1442695043ULL;
inside the anonymous namespace at the top of the test file) and replace the
literal uses in the HyperLogLogTests::DistinctValuesProduceCardinality test (and
any other tests using the same LCG pattern) so the loop uses val = val *
kLcgMultiplier + kLcgIncrement; to improve readability and reuse.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/hll_test.cpp`:
- Around line 178-200: Rename the test whose TEST macro uses the name
"HyperLogLogTests, MergeOverlappingSets" to reflect the actual behavior (the two
LCG seeds produce disjoint sets); change the TEST identifier to something like
"HyperLogLogTests, MergeDisjointSets" and update any references to this test
name; ensure the test function name in the TEST macro matches the comment and
intent (or alternatively change the data generation to produce overlapping
values if you prefer to keep the original name).

---

Nitpick comments:
In `@tests/hll_test.cpp`:
- Around line 54-63: Extract the LCG magic numbers into named constexprs (e.g.,
add constexpr uint64_t kLcgMultiplier = 6364136223846793005ULL; and constexpr
uint64_t kLcgIncrement = 1442695043ULL; inside the anonymous namespace at the
top of the test file) and replace the literal uses in the
HyperLogLogTests::DistinctValuesProduceCardinality test (and any other tests
using the same LCG pattern) so the loop uses val = val * kLcgMultiplier +
kLcgIncrement; to improve readability and reuse.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b8e0ee5-88ac-4b2f-8c69-b310bdc9e778

📥 Commits

Reviewing files that changed from the base of the PR and between 6482641 and 72dfcc3.

📒 Files selected for processing (2)
  • include/common/hll.hpp
  • tests/hll_test.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • include/common/hll.hpp

Comment thread tests/hll_test.cpp
Comment on lines +178 to +200
TEST(HyperLogLogTests, MergeOverlappingSets) {
HyperLogLog hll1;
HyperLogLog hll2;
uint64_t val = 123456789ULL;
for (int i = 0; i < 100; ++i) {
hll1.insert(val);
val = val * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t val2 = 987654321ULL;
for (int i = 0; i < 100; ++i) {
hll2.insert(val2);
val2 = val2 * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t card1 = hll1.cardinality();
uint64_t card2 = hll2.cardinality();
hll1.merge(hll2);
uint64_t merged = hll1.cardinality();
// Merged cardinality should be >= either individual
EXPECT_GE(merged, card1);
EXPECT_GE(merged, card2);
// Both sets are disjoint with good distribution, merged should be in a reasonable range
EXPECT_LT(merged, 50000U); // Sanity upper bound
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test name does not match implementation.

The test is named MergeOverlappingSets but uses disjoint sets (different LCG seeds: 123456789ULL vs 987654321ULL). The comment on line 198 even states "Both sets are disjoint." Consider renaming to MergeDisjointSets or similar to avoid confusion.

📝 Suggested rename
-TEST(HyperLogLogTests, MergeOverlappingSets) {
+TEST(HyperLogLogTests, MergeDisjointSets) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TEST(HyperLogLogTests, MergeOverlappingSets) {
HyperLogLog hll1;
HyperLogLog hll2;
uint64_t val = 123456789ULL;
for (int i = 0; i < 100; ++i) {
hll1.insert(val);
val = val * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t val2 = 987654321ULL;
for (int i = 0; i < 100; ++i) {
hll2.insert(val2);
val2 = val2 * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t card1 = hll1.cardinality();
uint64_t card2 = hll2.cardinality();
hll1.merge(hll2);
uint64_t merged = hll1.cardinality();
// Merged cardinality should be >= either individual
EXPECT_GE(merged, card1);
EXPECT_GE(merged, card2);
// Both sets are disjoint with good distribution, merged should be in a reasonable range
EXPECT_LT(merged, 50000U); // Sanity upper bound
}
TEST(HyperLogLogTests, MergeDisjointSets) {
HyperLogLog hll1;
HyperLogLog hll2;
uint64_t val = 123456789ULL;
for (int i = 0; i < 100; ++i) {
hll1.insert(val);
val = val * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t val2 = 987654321ULL;
for (int i = 0; i < 100; ++i) {
hll2.insert(val2);
val2 = val2 * 6364136223846793005ULL + 1442695043ULL;
}
uint64_t card1 = hll1.cardinality();
uint64_t card2 = hll2.cardinality();
hll1.merge(hll2);
uint64_t merged = hll1.cardinality();
// Merged cardinality should be >= either individual
EXPECT_GE(merged, card1);
EXPECT_GE(merged, card2);
// Both sets are disjoint with good distribution, merged should be in a reasonable range
EXPECT_LT(merged, 50000U); // Sanity upper bound
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/hll_test.cpp` around lines 178 - 200, Rename the test whose TEST macro
uses the name "HyperLogLogTests, MergeOverlappingSets" to reflect the actual
behavior (the two LCG seeds produce disjoint sets); change the TEST identifier
to something like "HyperLogLogTests, MergeDisjointSets" and update any
references to this test name; ensure the test function name in the TEST macro
matches the comment and intent (or alternatively change the data generation to
produce overlapping values if you prefer to keep the original name).

Copy link
Copy Markdown
Owner Author

@poyrazK poyrazK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to merge

@poyrazK poyrazK merged commit f2fae18 into main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant